Skip to main content

Support Vector Machine


Introduction

info

The Support Vector Machine (SVM) is a linear classifier that can be viewed as an extension of the Perceptron developed by Rosenblatt in 1958. The Perceptron guaranteed that you find a hyperplane if it exists. The SVM finds the maximum geometric margin separating hyperplane.

Setting

We define a linear classifier: h(x)=sign(wTx+b)h(x)=\mathrm{sign}(w^Tx+b) and we assume a binary classification setting with labels {+1,1}\left\{+1, -1 \right\}.

The left figure shows two different separating hyperplanes for the same dataset. The right figure shows the maximum margin hyperplane. The margin γ\gamma is the distance from the hyperplane (solid line) to the closest points in either class (which touch the parallel dotted lines).

Therefore, given a linearly dataset {xi,yi}i=2n{\left\{x_i,y_i\right\}}^n_{i=2}, the minimum geometric margin is defines as

γ(w,b):=minxDwTxi+bw2\gamma(\mathbf{w}, b) := \min _{\mathbf{x} \in D} \frac{\left|\mathbf{w}^T \mathbf{x_i}+b\right|}{\|\mathbf{w}\|_2}

Our goal is to find (w,b)(w,b), such that it separates the data, and maximizes γ(w,b)\gamma(w,b).

maxw,b1w2minxiwTxi+b\max_{w,b}\frac{1}{\|\mathbf{w}\|_2}\min _{x_i} \left|\mathbf{w}^T \mathbf{x_i}+b\right|

such that

yi(wTxi+b)0,iy_i(\mathbf{w}^T \mathbf{x_i}+b)\ge 0, \forall i

Margin

We already saw the definition of a margin in a context of the perception.

Now let's take a look at this picture.

A hyperplane is defined through w,b\mathbf{w}, b as a set of points such that H={xwTx+b=0}\mathcal{H}=\left\{\mathbf{x} \mid \mathbf{w}^T \mathbf{x}+b=0\right\}. Let the margin γ\gamma be defined as the distance from the hyperplane to the closest point across both classes. What is the distance of a point xx to the hyperplane H\mathcal{H} ?

Consider some point x\mathbf{x}. Let d\mathbf{d} be the vector from H\mathcal{H} to x\mathbf{x} of minimum length. Let xP\mathbf{x}^P be the projection of x\mathbf{x} onto H\mathcal{H}. It then follows that:

xP=xd\quad \mathbf{x}^P=\mathbf{x}-\mathbf{d}

d\mathbf{d} is parallel to w\mathbf{w}, so d=αw\mathbf{d}=\alpha \mathbf{w} for some αR\alpha \in \mathbb{R}. xPH\mathbf{x}^P \in \mathcal{H} which implies wTxP+b=0\mathbf{w}^T \mathbf{x}^P+b=0. Therefore,

wTxP+b=wT(xd)+b=wT(xαw)+b=0\mathbf{w}^T \mathbf{x}^P+b=\mathbf{w}^T(\mathbf{x}-\mathbf{d})+b=\mathbf{w}^T(\mathbf{x}-\alpha \mathbf{w})+b=0

which implies α=wTx+bwTw\alpha=\frac{\mathbf{w}^T \mathbf{x}+b}{\mathbf{w}^T \mathbf{w}}.

The length of d\mathbf{d} is calculated by

d2=dTd=α2wTw=wTx+bwTw=wTx+bw2\|\mathbf{d}\|_2=\sqrt{\mathbf{d}^T \mathbf{d}}=\sqrt{\alpha^2 \mathbf{w}^T \mathbf{w}}=\frac{\left|\mathbf{w}^T \mathbf{x}+b\right|}{\sqrt{\mathbf{w}^T \mathbf{w}}}=\frac{\left|\mathbf{w}^T \mathbf{x}+b\right|}{\|\mathbf{w}\|_2}

Max Margin Classifier

Margin of H\mathcal{H} with respect to D:γ(w,b)=minxDwTx+bw2D: \gamma(\mathbf{w}, b)=\min _{\mathbf{x} \in D} \frac{\left|\mathbf{w}^T \mathbf{x}+b\right|}{\|\mathbf{w}\|_2}.

By definition, the margin and hyperplane are scale invariant: γ(βw,βb)=γ(w,b),β0\gamma(\beta \mathbf{w}, \beta b)=\gamma(\mathbf{w}, b), \forall \beta \neq 0.

Note that if the hyperplane is such that γ\gamma is maximized, it must lie right in the middle of the two classes. In other words, γ\gamma must be the distance to the closest point within both classes. (If not, you could move the hyperplane towards data points of the class that is further away and increase γ\gamma, which contradicts that γ\gamma is maximized.)

We can formulate our search for the maximum margin separating hyperplane as a constrained optimization problem. The objective is to maximize the margin under the constraints that all data points must lie on the correct side of the hyperplane:

maxw,bγ(w,b)maximize margin  such that iyi(wTxi+b)0separating hyperplane \underbrace{\max _{\mathbf{w}, b} \gamma(\mathbf{w}, b)}_{\text {maximize margin }} \text { such that } \underbrace{\forall i y_i\left(\mathbf{w}^T x_i+b\right) \geq 0}_{\text {separating hyperplane }}

If we plug in the definition of γ\gamma we obtain:

maxw,b1w2minxiwTxi+bmaximize margin  such that iyi(wTxi+b)0separating hyperplane \underbrace{\max_{w,b}\frac{1}{\|\mathbf{w}\|_2}\min _{x_i} \left|\mathbf{w}^T \mathbf{x_i}+b\right|}_{\text {maximize margin }} \text { such that } \underbrace{\forall i y_i\left(\mathbf{w}^T x_i+b\right) \geq 0}_{\text {separating hyperplane }}

Because the hyperplane is scale invariant, we can fix the scale of w,b\mathbf{w}, b anyway we want. Let's be clever about it, and choose it such that

minxDwTx+b=1\min _{\mathbf{x} \in D}\left|\mathbf{w}^T \mathbf{x}+b\right|=1

We can add this re-scaling as an equality constraint. Then our objective becomes:

maxw,b1w21=minw,bw2=minw,bww\max_{\mathbf{w}, b} \frac{1}{\|\mathbf{w}\|_2} \cdot 1=\min_{\mathbf{w}, b}\|\mathbf{w}\|_2=\min_{\mathbf{w}, b} \mathbf{w}^{\top} \mathbf{w}

Where we made use of the fact f(z)=z2f(z)=z^2 is a monotonically increasing function for z0z\ge 0 and w0||\mathbf{w}||\ge 0; i.e. the w\mathbf{w} that maximizes w2||\mathbf{w}||_2 also maximizes wTw\mathbf{w}^T\mathbf{w}.

Therefore, the new optimization becomes:

minw,bwTw\min_{\mathbf{w},b}\mathbf{w}^T\mathbf{w}

such that

yi(wTxi+b)0,iy_i(\mathbf{w}^T \mathbf{x_i}+b)\ge 0, \forall i
minxDwTxi+b=1\min _{\mathbf{x} \in D}\left|\mathbf{w}^T \mathbf{x_i}+b\right|=1

In order to make the constaints easier to solve, we make make the above optimization into:

minw,bwTw\min_{\mathbf{w},b}\mathbf{w}^T\mathbf{w}

such that

yi(wTxi+b)1,iy_i(\mathbf{w}^T \mathbf{x_i}+b)\ge 1, \forall i

, which is equivalent to the above optimization.


Support Vectors

For the optimal pair, some training points will have tight constraints, i.e.

yi(wTxi+b)=1y_i(\mathbf{w}^T \mathbf{x_i}+b)= 1

We refer to these training points as support vectors. Support vectors are special because they are the training points that define the maximum margin of the hyperplane to the data set and they therefore determine the shape of the hyperplane. If you were to move one of them and retrain the SVM, the resulting hyperplane would change. The opposite is the case for non-support vectors (provided you don't move them too much, or they would turn into support vectors themselves). This will become particularly important in the dual formulation for Kernel-SVMs.


SVM with Soft Constraints

Note

If the data is low dimensional it is often the case that there is no separating hyperplane between the two classes. In this case, there is no solution to the optimization problems stated above. We can fix this by allowing the constraints to be violated ever so slight with the introduction of slack variables:

minw,bwTw+Ci=1nξi s.t. iyi(wTxi+b)1ξiiξi0\begin{gathered} \min _{\mathbf{w}, b} \mathbf{w}^T \mathbf{w}+C \sum_{i=1}^n \xi_i \\ \text { s.t. } \forall i y_i\left(\mathbf{w}^T \mathbf{x}_i+b\right) \geq 1-\xi_i \\ \forall i \xi_i \geq 0 \end{gathered}

The slack variable ξi\xi_i allows the input xi\mathbf{x_i} to be closer to the hyperplane (or even be on the wrong side), but there is a penalty in the objective function for such "slack". If CC is very large, the SVM becomes very strict and tries to get all points to be on the right side of the hyperplane. If CC is very small, the SVM becomes very loose and may "sacrifice" some points to obtain a simpler solution.


Unconstrained Formulation

Let us consider the value of ξi\xi_i for the case of C0C \neq 0. Because the objective will always try to minimize ξi\xi_i as much as possible, the equation must hold as an equality and we have:

ξi={1yi(wTxi+b) if yi(wTxi+b)<10 if yi(wTxi+b)1\xi_i=\left\{\begin{array}{cc} 1-y_i\left(\mathbf{w}^T \mathbf{x}_i+b\right) & \text { if } y_i\left(\mathbf{w}^T \mathbf{x}_i+b\right)<1 \\ 0 & \text { if } y_i\left(\mathbf{w}^T \mathbf{x}_i+b\right) \geq 1 \end{array}\right.

This is equivalent to the following closed form:

ξi=max(1yi(wTxi+b),0).\xi_i=\max \left(1-y_i\left(\mathbf{w}^T \mathbf{x}_i+b\right), 0\right) .

If we plug this closed form into the objective of our SVM optimization problem, we obtain the following unconstrained version as loss function and regularizer:

minw,bwTwl2 regularizer +Ci=1nmax[1yi(wTx+b),0]hinge-loss \min _{\mathbf{w}, b} \underbrace{\mathbf{w}^T \mathbf{w}}_{l_2-\text { regularizer }}+C \sum_{i=1}^n \underbrace{\max \left[1-y_i\left(\mathbf{w}^T \mathbf{x}+b\right), 0\right]}_{\text {hinge-loss }}

This formulation allows us to optimize the SVM paramters (w,b)(\mathbf{w}, b) just like logistic regression (e.g. through gradient descent). The only difference is that we have the hinge-loss instead of the logistic loss.

The five plots above show different boundary of hyperplane and the optimal hyperplane separating example data, when C=0.01,0.1,1,10,100C=0.01, 0.1, 1, 10, 100.


Reference

Weinberger, K. (n.d.). Support Vector Machine. CS4780 Lecture Note 09. https://www.cs.cornell.edu/courses/cs4780/2022fa/lectures/lecturenote09.html

Acknowledgement

Parts of the content on this website have been sourced from the following website:

https://www.cs.cornell.edu/courses/cs4780/2022fa/lectures/lecturenote09.html

written by Professor Kilian Weinberger from Cornell University. We would like to express our gratitude to the original authors while respecting their copyright and intellectual property rights. All credits go to Professor Kilian Weinberger.